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Abstract 

We consider estimation of the structural distribution function of the cell probabilities 
of a multinomial sample in situations where the number of cells is large. We review the 
performance of the natural estimator, an estimator based on grouping the cells and a 
kernel type estimator. Inconsistency of the natural estimator and weak consistency of the 
other two estimators is derived by Poissonization and other, new, technical devices. 
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1 The structural distribution function 

Let the vector X = (Xi, . . . , Xm) denote a mult(n,pM) distributed random vector, where pm = 
{Pmi,Pm2, ■ ■ ■ ,Pmm) is the vector of cell probabilities. Hence, the nonnegative components of 
Pm satisfy pui + • • • + Pmm = 1- 

We will consider situations where M = Mn is large with respect to n, i.e. 

M/n 7^ 0, as n ^ oo. (1.1) 

*Financed by INTAS-97-Georgia-1828 
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In these cases X/n does not estimate pu accurately. For instance, for the average mean squared 
error in estimating MpMi, i = 1, . . . , M, we have 

V M 4^ , , M ( AoX, 

M E E ^ - ^Vm) = -Y^PM^il - P.n) = - (l - J2pln) ^ 0' 

i=l i=l i=l 

unless YldLi Pui ~^ 1 holds, i.e. unless pm comes close to a unit vector (0, . . . , 0, 1, 0, . . . , 0). 

However, there are characteristics of pm that can be estimated consistently. Here we will 
study the structural distribution function of pm- It is defined as the empirical distribution 
function of the MpMi, i = 1, . . . M, and it is given by 

M 

Fm{x) = j^Yl MMPM^<X], x>0. (1.2) 
i=l 

Our basic assumption will be that Fm converges weakly to a limit distribution function F, i.e. 

Fm F, as n ^ oo. (1.3) 

The basic estimation problem is how to estimate Fm (or F) from an observation of X. 

A rule of thumb in statistics is to replace unknown probabilities by sample fractions. This 
yields the so called natural estimator. This estimator, denoted by Fm, is equal to the empirical 
distribution function based on M times the cell fractions Xi/n, so 

1 ^ 

P^i^) = MT.h^x.<.y (1-4) 

i=l 

This estimator has often been used in linguistics, but turns out to be inconsistent for estimating 
F] see Section |5.1| , Khmaladze (1988), and Klaassen and Mnatsakanov (2000). 

Our estimation problem is related to estimation in sparse multinomial tables. For recent 
results on the estimation of cell probabilities in this context see Aerts, Augustyns and Janssen 
(2000). 

In Section ^ we present a small simulation study of a typical multinomial sample and the 
behavior of the natural estimator. It turns out that smoothing is required to obtain weakly 
consistent estimators. An estimator based on grouping and an estimator based on kernel 
smoothing are presented in Section ^ Section | deals with the technique of Poissonization and 
with the relation between weak and Li consistency. These basic results are used in the weak 
consistency proofs in Section |^. Section || contains a discussion. 

2 A simulation 

We have simulated a sample with M = 1000 and n = 2000. The cell probabilities are generated 
via 

PMi = G{i/M) - G{{i - 1)/M), z = 1, . . . , M. (2.5) 
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The distribution function G and its density g have been chosen equal to the functions 

g{x) = 30x^(1 - xf and G{x) = lOx^ - 15x^ + 6x^ < x < 1. (2.6) 
In Section ^ we show that for these cell probabilities, the limit structural distribution function 



F from (|1.3|) is equal to the distribution function of g{U). Here it is given by 



Fix) = 1 - 



These functions are drawn in Figure |l[ 




(2.7) 



25 0.5 0.75 1 1.25 1.5 1.75 



Figure 1: The function g and the corresponding structural distribution function F. 

For this simulated sample we have plotted the cell counts, multiplied by M/ n, and the natural 
estimate in Figure ^ Comparison with the real F in Figure |I] clearly illustrates the inconsistency 
of the natural estimator. 




Figure 2: The function g, M/n times the cell counts, and in the second figure the natural 
estimator of F. 



3 Estimators based on smoothing techniques 

Up to now we have only assumed that the structural distribution function Fm converges weakly 
to a limit distribution function F. /^From now on we will assume more structure. 
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Consider the function 

M 

gM{u) = y^MpMi^ii=i J_l(^i), ueB.. (3.8) 

i=l 

This step function is a density representing the cell probabilities and we shall call it the parent 
density. The relation between this parent density Qm and the structural distribution function 
Fm is given by the fact that if t/ is a uniform(0,l) random variable then Fm is the distribution 
function oi gM{U). Note that 

^9m{U) = / gM{u)du = ^PMi = 1, (3.9) 

i=i 

so gM is a probability density indeed. 

We will assume that there exists a limiting parent density g on [0,1] such that, as n ^ oo, 

sup \gM{u) - g{u) \ 0. (3.10) 

0<«<1 

Consequently we have guiU) — > 5'(t/), almost surely, and hence Fm — * F. 

The inconsistency of the natural estimator can be hfted by first smoothing the cell counts 
Xi. We consider two smoothing methods, grouping, which is actually some kind of histogram 
smoothing, and a method based on kernel smoothing of the counts. 

3.1 Grouping 

Let rrijkjjj — 0,1, ... ,171, be integers, all depending on n, such that — kg < ki < . . . < — 
M. Define the group frequencies Xj as 

Xj^ J2 j = l,...,m. (3.11) 

i=fcj_i+l 

Then the vector of grouped counts X is again multinomially distributed, 

X = {Xi,...,X^) ^ mult{n,qm), (3.12) 
where = (gmi, ■ ■ ■ , ?mm) and 

kj 

^mj = ^ PMi, j = 1, . . . , m. (3.13) 

i=kj-i+l 

The grouped cells estimator, introduced in Klaassen and Mnatsakanov (2000), is defined by 

^ m 

^-(-) = M ^^^^ - " ^ °- ^3.14) 
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This estimator may be viewed as a structural distribution function with parent density 

M 



9m[u) 



^ n{ki - ki_i) y-Ti 



(3.15) 



This histogram is an estimator of the hmiting parent density g in ( |3.10| ). We will prove weak 
consistency of the corresponding estimator Fm in Section |5.2| . 

For our simulated example the estimates of g and F resulting from grouping with equal 
group size k = 50 are given in Figure ^ 
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Figure 3: g, F, and estimates (Jm and Fm by grouping with equal cell size. 



3.2 A kernel type estimator 

Now that we have seen that the estimator based on the grouped cells counts is in fact based on 
a histogram estimate of the parent density g we might also use kernel smoothing to estimate g 
and proceed in a similar manner. If we choose a probability density w as kernel function and 
a bandwidth > 0, we get the following estimator for the parent density g 



M 



9m[u) 



nk ^ V k 



(3.16) 



As an estimator for the structural distribution function of the function F we take the empirical 
distribution function of gu^U) with U uniform, namely 



1 ^ 



(3.17) 



Weak consistency of this estimator will be derived in Section |5]3 . 

For our simulated example kernel estimates cjM and Fm of g and F, respectively, with k 
equal to 50 are given in Figure H. 



4 Relevant techniques 

In our proofs we shall use repeatedly the powerful method of Poissonization and a device 
involving Li convergence. 



6 



1.5 



0.5 




0.2 0.4 0.6 



0.5 1 1.5 



Figure 4: F, and estimates qm and Fm by kernel smoothing. 



4.1 Poissonization 

Consider the random vectors X and Y , with 

X = (Xi, . . . , Xm) ~ mult{n, pm) and Y = (Yi, 
where Yi, . . . , Ym are independent. Note 

M 



Ym), Yi ~ Poisson(np 



Mi 



N = Yi ~ Poisson(n). 



(4.18) 



(4.19) 



Given N = k the random vector Y has a mu.lt{k, pm) distribution. 

Based on an infinite sequence of mult{l,pMi, ■ ■ ■ ,Pmm) random vectors one can construct 
vectors X and Y, the cell counts over n and of these vectors repectively, with the distributions 
(|4.18| ). Given N = k they are coupled as follows 



k < n : X = Y + mu\t{n — k,pM), 
k > n : Y = X + mu\t{k — UjPm)- 

Note that this shows that either X,- < Y for all i or X,- > Yi for all i. 



(4.20) 



4.2 Convergence in Li and weak convergence 

An important step in the (in) consistency proofs is to show that "Poissonization is allowed", 
i.e. that we can transfer the limit result for the estimator based on the Poissonized sample, the 
"Poissonized version" , to the original estimator. The following proposition is used repeatedly, 
also if no Poissonized version is involved. 

Proposition 4.1 Let F be a distribution function and let Fn and Fn be possibly random dis- 
tribution functions. If 

Fn F, in probability, (4-21) 

and 

J \Fn-K\^0 (4.22) 
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hold, then 

Fn ^ F, in probability, (4.23) 

is valid, i.e. for all e > and all continuity points Xq of F 

Pi\Fn{xo)-F{xo)\>e)^0. (4.24) 

In the special case where F„ equals F, the proposition states that Li convergence implies weak 
convergence. 

Proof Note that for all Xq and all 5 > we have 

/•xo+5 POO _ 

/ \Fn-F\< \Fr,-F\+ \Fn-F\. (4.25) 

Let Xq denote an arbitrary continuity point of F and e an arbitrary positive number. Choose 
S > such that F{xo + 6) — F{xq — 5) < e and such that Xq — 5 and Xq + 5 are continuity points 
ofF. Then 

|F„(xo -5)- F{xo - 5)1 < e and |F„(xo + 6)- F{xo + 6)\ < e (4.26) 

imply 

fXQ+S 

\Fn -F\< 4Se. (4.27) 



XQ-S 



Hence, we have 



pxo+S 

P[ / \Fn-F\>A5e'^ 
^Jxo-s ' (4.28) 

< P(|F„(xo - 5) - F(xo - 5)1 > e) + P(|F„(xo + 5) - F(xo + 5)| > e) 



and, by O, 



|F„ - F| ^ 0. (4.29) 



'zo-<5 

Consequently, by (|42^ ) and (|]2D we get 

|F„ - F| 0. (4.30) 

a;o-(5 

Choose < (5' < (5 such that F(xo + 5') < F(a;o) + |e and F(xo - 5') > F(xo) - |e. Then 
we see 

/•a:o+5' 

|F„(xo) - F(xo)| > e ^ / |F„ - F| > i^'e (4.31) 

and hence 

fx-o+<5' 



P(|F„(xo) - F(xo)| >e)<p( [ ' |F„ - F| > i^'e) 



< p( / |f„-f|>KO^O- 

^Jxo-S ' 

Since this holds for arbitrary continuity points Xq and arbitrary e > we have established 
Fn — > F , in probability. □ 
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5 Consistency 

5.1 The natural estimator 



The basic trick in dealing with the difference of the natural estimator and its Poissonized 
version, 

1 ^ 

^m(x) = — (5.32) 
1=1 

uses the coupling as in ( |4.2U| ) and is given by the following string of inequalities 



1 ^' 

\Fm{x) - Fm{x)\ <j^Y1 \\m^<x] - \m^<x]\ 

1=1 

<■ 1 1 r . < - n ( 



(5.33) 



By ( |1 . 1| ) the right hand side converges to zero in probability and this shows that Poissonization 
is allowed. 

Because of the independence of the Poisson counts Yi we can easily bound the variance of 
the Poissonized estimator. We get 

1 1 

Var FM(a:) = Var - ^ l^.^v,^^, ^ iM 0- ^^'34) 

1=1 

We also have 

I M ^ 1 ^ 

^Fm{x) = J2p{ < x) ^ - ^1[,,,^^^<,] = Fm{x) (5.35) 



M ^ ' n ' ' M 

i=l i=l 



and 



E / x'^dFiifix 



i=l 

E£i (f ) Vpm. + {npM^y} = f + Ix'd Fm{x) 



(5.36) 



Together with ( |1 . 1| ) this gives two reasons why Fm{x) is probably not a consistent estimator of 
F. Then, by (|5.33|) the natural estimator has to be inconsistent too. 

The inconsistency of the structural distribution function has been established in Khmaladze 
(1988), Khmaladze and Chitashvili (1989), Klaassen and Mnatsakanov (2000) and Van Es and 
Kolios (2002). In these papers the situation is considered of a large number of rare events, i.e. 
n/M — >■ A for some constant A. The explicit limit in probability of Fm{x) turns out to be a 
Poisson mixture of F then. 
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5.2 Grouping 

Under the additional assumption n/M — > A, for some constant A, weak consistency of the 
estimator based on grouped cells has been proved, without using Poissonization, by Klaassen 
and Mnatsakanov (2000) and by the Poissonization method for the simpler case of equal group 
size, i.e. kj = k, by Van Es and Kolios (2002). We shall prove the following generalization 
without using Poissonization. 



Theorem 5.1 Ifm/n — > 0, 



and 



kj Ivj 1 

'""P M 

l<j<m M 



0, 



sup \gM{u) - g{u)\ 

0<n<l 

are valid for some limiting parent density g that is continuous on [0, 1], then 

Fm F, in probability, 

holds with 

^ rn 

Fm{x) = — - kj-i)l ^ 



Proof 

The estimator Fm behaves asymptotically as 



^ m 

Fm{x) = — ^{kj - kj-i)l^_M^ 



Indeed, in view of J |l[a<a;] — l[b<a;]|(ia; = |6 — a| we have 
J \Fm{x) - FM{x)\dx 

- J ^ M 



1 m£2_^ - 1 ^ 



<x] 



dx 



= E 



kj kj_i 


MXj 


M 


n{kj - kj-i) 



Mq, 



mj 



Consequently, we obtain 



m 

^ 777, X 

\Fm{x) - FM{x)\dx < E I^J ~ '^^rnj 



< 



< 



m 



n 



^ m 

fit 



m 



3=1 



n 



m 



^nq^j{l - q^j) 



m 

^ 

n 



(5.37) 
(5.38) 

(5.39) 
(5.40) 



(5.41) 



(5.42) 
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and hence 



\Fm{x) - FM{x)\dx ^ 0. (5.43) 



In order to prove Fm — >■ -F in probability, by Proposition ^]T] it remains to show Fm F. 
Consider the function 



9m{u) = J2t 7— Yl MpMil,^_^^Ju). (5.44) 

For kj^i/M <u < kj/M we have 

1 

\gM{u) - g{u)\ < ^ _ ^ \MpMi - giu)\ 

< sup \gMiv)-g{u)\ 

kj-i/M<v<kj/M 

< snp\gMiv) - giv)\+ sup \g{v) - g{u)\. 

V \u-v\<s\ipj{kj-kj-i)/M 

By assumption, the function g is uniformly continuous and hence supj{kj — kj_i)/M — ^ 

implies gM{U) — > g{U), almost surely, and in distribution, i.e. F ^ F, which completes the 
proof of the theorem. □ 

5.3 The kernel type estimator 

Weak consistency of the kernel type estimator is established by the next theorem. 

Theorem 5.2 If k oo, k/M —>■ 0,M/{nk) —>■ hold, if w is a density that is Riemann 
integrable on bounded intervals, that is also Riemann square integrable on bounded intervals, 
and that has bounded support or is ultimately monotone in its tails, and if 

sup \gM{u) - g{u) \ (5.45) 

0<n<l 

holds with g continuous on [0, 1], then 

Fm F, in probability, (5.46) 

is valid for 

1 



Proof Let 

M 



M 
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be the Poissonized version of Fm{x). Note that by the couphng argument Xj > Yi for all i or 
Xi < Yi for all i. Since w is a Riemann integrable density we thus get 

E I \Fm{x) - FM{x)\dx 



< 



M 

E-y 



M 



M 



M 



M M 



W 



J 

k 

3 -i 



M , , M 

1 ^ M ^ J - I 



M ^ nk 

j=i i=i 



\Xi — YA 



i=l 7=1 l£Z 



\N - n\ 



n 



Oi — 



Consequently, by Proposition it suffices to prove 

Fm F, in probability. 

Define 

1 



To prove ( |5.49| ), by Proposition [4.1| , it suffices to prove 

e/iFmW-FmWK^^^O and Fm ^ F. in probabihty. 

Indeed, since the Yi are independent and w is square Riemann integrable, we have 



E / \FkAx 



M M . . 

M.)M.<-i$:E|Mj:.(i^)(K. 

j=l 1=1 



npMi 



< 



\ 



M 



— y Var < — y w 



M 



M 



nk 



3 - « 



{Yi - nj)Mi) 



\ 



j=l i=l 



i=i lei. 



M ^ 1 
nk ^ k 



^ k) ^\nk 



(5.49) 
(5.50) 

(5.51) 



because of A; — > oo and M/{nk) — > 0. This proves the first statement of ( p.51| ). 

Finally, we prove the second statement of ( ^.51| ). As parent density for the distribution 
function Fm we choose 



M ^ M 



(5.52) 



j=i i=i 
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Note that gu vanishes outside (0,1]. Fix u G (0, 1). For u G (^, -j^], and > fixed, we 



have 



Igniu) -giu)\ 



y- 



1 n 
+ ^ k'^Kk 



9m 



J - A (J - ^ 



\l\<Kk 



M 



M 



M 



k \k 



\e\>Kk ee2 
Note that the conditions imposed on w guarantee that 



I. 



\e\>Kk 



is arbitrarily small for K sufficiently large, that 



\e\<Kk 



sup g[u) 

u 



which is arbitrarily close to one for K large enough, and hence that 



(5.53) 



(5.54) 



(5.55) 



as k ^ oo. Consequently, in view of ( |5.45|) , and in view of the uniform continuity and bound- 
edness of g, all three terms at the right hand side tend to zero as /c — oo and subsequently 
K oo. So, guiU) g{U), almost surely and in distribution, which implies Fm — > F. 



□ 



6 Discussion 

The key assumption in the consistency proofs of the grouping and kernel estimators is the 
existence of a limiting parent density. This is a reasonable assumption only, if there is a natural 
ordering of the cells and neighboring cells have approximately the same cell probabilities. In 
applications like e.g. linguistics this need not be the case. Consider a text of n words of an 
author with a vocabulary of M words. Here the words in the vocabulary correspond to the 
cells of the multinomial distribution and the existence of a limiting or approximating parent 
density is rather unrealistic. To a lesser extent this might be the case in biology, where cells 
correspond to species and n is the number of individuals found in some ecological entity. 

An estimator that is consistent even if our key assumption does not hold, has been con- 
structed in Klaassen and Mnatsakanov (2000). However, it seems to have a logarithmic rate of 
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convergence only. The rates of convergence of our grouping and kernel estimators will depend 
on the rate at which the assumed limiting parent density can be estimated. This issue is still 
to be investigated, but under the assumption n/M A, for some constant A, Van Es and 
Kolios (2002) show that, for the relatively simple case of equal group size, an algebraic rate of 
convergence can be achieved by the estimator based on grouping. 

Since the estimators studied here are based on smoothing of the cell frequencies an important 
open problem is the choice of the smoothing parameter. For the estimator based on grouping 
this is the choice of the sizes of the groups and for the kernel type estimator the choice of the 
bandwidth. By studying convergence rates these choices may be optimized. 
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